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Abstract —Convolutional Neural Networks (CNNs) are power¬ 
ful models that achieve impressive results for image classification. 
In addition, pre-trained CNNs are also useful for other computer 
vision tasks as generic feature extractors m. This paper aims to 
gain insight into the feature aspect of CNN and demonstrate 
other uses of CNN features. Our results show that CNN feature 
maps can be used with Random Forests and SVM to yield 
classification results that outperforms the original CNN. A CNN 
that is less than optimal (e.g. not fully trained or overfitting) 
can also extract features for Random Forest/SVM that yield 
competitive classification accuracy. In contrast to the literature 
which uses the top-layer activations as feature representation of 
images for other tasks m, using lower-layer features can yield 
better results for classification. 

1. Introduction 

Convolutional Neural Networks (CNNs) have proven to be 
very successful frameworks for image recognition. In the past 
few years, variants of CNN models achieve increasingly better 
performance on the renowned ImageNet dataset for object 
classification, starting from AlexNet from |2|, OverEeat 1^ . 
GoogLeNet (H, and a recent model by O with classification 
accuracy surpassing human-level performance. Nevertheless, 
there are still many aspects of CNNs that researchers are 
striving to understand. 

In recent years, research that seeks to gain insights into 
CNN models include exploring new non-linear activation func¬ 
tions, new training techniques, optimal network configurations, 
etc. Eor instance, O explores the ReLU activation function 
which controls sparsity and helps speed up training time, [Tl 
and i) introduces training techniques that reduce overfitting, 
0 explores the reduced dimensionality of CNN output layer 
that yields good performance, to name but a few. 

Collectively, these research help increase the understanding 
and consequently the performance of CNNs. Our research aims 
to understand another aspect of CNNs - the feature maps. 

The idea of exploring CNN features is also motivated by 
their usefulness on a wide variety of tasks. As introduced 
earlier, the activations which are the output of CNN layers 
can be interpreted as visual features. CNN models which are 
trained for classification have been used as feature extractors 
by removing the output layer (which output class scores). In 
an AlexNet, this would compute a 4096-dimensional vector 
for each input image (CNN codes). In particular, a pre-trained 
CNN on ImageNet dataset can be used as a generic feature 


extractor for other datasets m Eeatures extracted from pre¬ 
trained CNN such as OverEeat have been successfully used 
in computer vision tasks such as scene recognition, object 
attribute detection and achieves better results compared to 
handcrafted features m . Given the usefulness of CNN features, 
our research aims to further assess the features and demonstrate 
how we can use the features for other tasks. 

II. Methodology and Results 
A. Model Setup 

1) Dataset: In our experiments, we use the plankton data 
provided by Oregon State University Hatfield Marine Science 
Center. 

We use 30,300 labelled samples from the train data, 
which comprises 121 classes. 




Fig. 1. Sample images of 3 classes with 4 samples shown for each class 

2) CNN Models : Eor our experiments, we use the follow¬ 
ing CNN architectures. 

TABLE 1. Architecture oe CNN 1 


Layer 

Layer Type 

Size 

Output Shape 

1 

Convolution + ReLU 

32 5x5 filters 


1 

Max Pooling 

2x2, stride 2 

(32,12,12) 

2 

Convolution + ReLU 

48 5x5 filters 


2 

Max Pooling 

2x2, stride 2 

(48,4,4) 

3 

Convolution + ReLU 

64 5 X 5 filters 


3 

Max Pooling 

2x2, stride 2 

(64,1,1) 

4 

Fully Connected + ReLU 

121 hidden units 

121 

5 

Softmax 

121 way 

121 


We use ReLU as an activation function which is a popular 
choice especially for deep networks. The activation function 
has been shown to speed up training time. m 

We note that each sample image varies in scale and is not 
necessarily square. To use these images on CNNs, we rescale 
them to 28 x 28 pixels for CNNl and CNN3 and to 40 x 40 


















TABLE IL 


Architecture oe CNN 2 


Layer 

Layer Type 

Size 

Output Shape 

1 

Convolution + ReLU 

32 5x5 filters 

(32,36,36) 

2 

Convolution + ReLU 

32 5x5 filters 


2 

Max Pooling 

2x2, stride 2 

(32,16,16) 

3 

Convolution + ReLU 

48 5x5 filters 

(48,12,12) 

4 

Convolution + ReLU 

48 5x5 filters 


4 

Max Pooling 

2x2, stride 2 

(48,4,4) 

5 

Convolution + ReLU 

64 3x3 filters 


5 

Max Pooling 

2x2, stride 2 

(64,1,1) 

6 

Fully Connected + ReLU 

121 hidden units 

121 

7 

Softmax 

121 way 

121 


TABLE III. Architecture oe CNN 3: Dropout used at eirst layer 


Layer 

Layer Type 

Size 

Output Shape 

1 

Convolution + Maxout 

48 8x8 filters 


1 

Max Pooling 

4x4, stride 2 

(48,10 ,10) 

2 

Convolution + Maxout 

48 8x8 filters 


2 

Max Pooling 

4x4, stride 2 

(48,4,4) 

3 

Convolution + Maxout 

24 5 X 5 filters 


3 

Max Pooling 

2x2, stride 2 

(24,3,3) 

4 

Softmax 

121 way 

121 


pixels for CNN2. This is because networks with more layers 
(such as CNN2) generally need a larger input size since the 
pooling layer exponentially reduces the size of the layer input. 

3) CNN Training : We follow recommended training pro¬ 
cedures in CNN literature. We use cross-entropy loss as our 
objective function that we seek to minimize. We use mini-batch 
stochastic gradient descent with momentum which is shown to 
be an effective method for training CNN oni. We also use the 
max-norm constraint approach to regularize weights ifTTIl . We 
split the samples into training, validation, and test set of size 
25000, 1500, 3800 respectively. 

B. CNN Features for Classification 

As mentioned before, CNN models such as OverFeat, 
AlexNet, GoogLeNet that are pre-trained on ImageNet can 
been used as generic feature extractors for other tasks. This is 
done by removing the top output layer and using the activations 
from the last fully connected layer (CNN codes) as features. Ill 
uses pre-trained OverFeat on other datasets to extract features 
and use these features on other computer vision tasks. It turns 
out that this off-the-shelf feature extractor give features that 
yield better results than handcrafted features (ll. 

As opposed to using CNN feature on other tasks, we 
are interested in using them for the original classification 
problem. To do this, we take features from our CNN trained 
on the plankton dataset and using them as training input for 
other classification methods (Random Forests and SVM). We 
are curious to see how the performance will compare to the 
baseline CNN classification accuracy. 

In this research, we also take activations from other layers 
in addition to the last fully connected layer as feature rep¬ 
resentation of images. This is not a conventional approach 
in the literature m We are interested to see if the lower- 
layer features are more suitable for classification with other 
algorithms. 

1) CNN SVM and Random Forest on CNN Features: 

First, we train CNNl described in Table |I] We pass the training 


and validation samples to the CNN and take layer activations as 
training input for SVM and Random Forest. These training and 
validation samples are the same samples used to train CNNs. 
We use feature maps of each layer in CNNl as training input 
for 3 classification models, namely. Random Forest, SVM 
(one-vs-all) and SVM (one-vs-one). SVM one-vs-all trains n 
classifiers for n = 121 classes. SVM one-vs-one trains ( 2 ) 
classifiers. 



0.25 - , , , , , , 
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Layer 


Classifier 

CNN Baseline 
Random Forest 
SVM 1v1 
- “ SVM IvAII 


Eig. 2. Test Set Prediction Accuracy of Random Forest, SVM and baseline 
CNNl 

The classification results are shown in Figure In this 
figure, layer 0 represents using raw data (as opposed to using 
CNN features) for training input. This is shown as comparison 
for each classification model to see how well they perform 
without using features from CNN. The classification accuracy 
of CNNl is shown as the baseline. 

A few points to highlight based on these results: 

• Random Forest and SVM trained by CNN features can 
perform better than the baseline CNN. This is quite 
surprising since the original CNN trains the weights 
which specify feature extraction. 

• We note that the main difference between the original 
CNN and Random Forest/SVM on CNN features is 
that the original CNN uses the last convolutional 
features on the trained neural network (fully connected 
-I- output layer). This is shown in the CNN architecture 
in Table |T| As opposed to using the trained neural 
network for prediction. Random Forest and SVM 
performs their own training by taking fixed CNN 
features. The features are also from many layers, not 
limited to the last convolutional layer. 

• Generally, higher layers seem to extract better features 
as indicated by the increasing performance of SVM 
and Random Forest from layer 1 to layer 4. However, 
the highest accuracy is achieved from using layer 3 
activations on SVM one-vs-one. This confirms our hy¬ 
pothesis that high-level features might not necessarily 
be better than low-level features. We note that layer 3 
corresponds to the last convolutional layer where layer 
4 corresponds to the fully connected layer (see Table |T| 
for more details). 
































We also did the same experiment with CNN2 (see Ta¬ 
ble |I^ for details). The results shown in Figure confirms a 
similar accuracy trend of increasing prediction accuracy using 
higher level features. However, the best prediction accuracy is 
achieved with SVM one-vs-one using activations from layer 4 
(one layer before the last convolutional layer). We also observe 
that for SVM one-vs-rest and Random Forest, the best accuracy 
is achieved using layer 5, the last convolutional layer. This also 
adds to the evidence that the last fully connected features might 
not be the most optimal as training input for some classification 
models. 


Prediction Accuracy for CNN2 



Classifier 

CNN Baseline 
Random Forest 
SVM 1v1 
SVM 1vAII 


Fig. 3. Test Set Prediction Accuracy of Random Forest, SVM and baseline 
CNN2 

2 ) Feature Significance: This section tests for significance 

of features with Random Forest. The test measures the relative 
importance of that feature based on how it infiuences the output 
prediction. Features used at the top of the tree contribute to 
the final prediction on a larger fraction of input samples ifT^ . 
The output of this feature importance test is a ratio from 0 
to 1 that ranks how important the feature is. For example, if 
there are n features and every feature is equally important, 
the importance scores are all To select significant features, 
we impose a threshold t = ^ ^ for a test with n total 

features. Figure shows the results for CNNl and CNN2. At 
the first fully connected layer (layer 4 in CNNl and layer 6 
in CNN2), there is a lower proportion of significant features. 
This might help explain why features at the fully connected 
layer can yield lower prediction accuracy than features at the 
previous convolutional layer. 

3) SVM and Random Forest on Early-Epoch CNN Features: 
Based on earlier results, a CNN with low prediction accuracy 
can give features that yield better classification results with 
Random Forest and SVM. We thus want to see the quality of 
features from a CNN that is not fully trained. To do this, we 
extract features from CNNl at epochs 0 to 62. Note that an 
increment in epoch means an additional pass through the whole 
training data while doing stochastic gradient descent. Epoch 0 
represents an initialized CNN (with randomized weights) that 
has not been trained at all. 

Figureshows the result of this experiment. We use CNNl 


Proportion of significant features 
in each layer for CNN1 


Proportion of significant features 
in each layer for CNN2 



Number of features in each layer 


Number of features in each layer 


Fig. 4. CNNs Feature Importance 


Prediction Accuracy for CNN1 



Classifier 

CNN Baseline 
Random Forest 
> SVM 1v1 
SVM tvAII 


Fig. 5. Prediction accuracy of CNN, Random Forest and SVM at varying 
epochs. Random Forest and SVM use layer 3 activations as training input. 


with features from layer 3 as the training input for SVM and 
Random Forest. The trend shows that CNN generally extracts 
better features at increasing epochs. However, we observe 
that Random Forest and SVM can achieve high accuracy 
results even at early epoch (24) while CNN’s accuracy is still 
increasing. Perhaps Random Forest and SVM can be used as 
to indicate an upper bound on the accuracy of CNN. Another 
interesting observation is that Random Forest still performs 
quite well on features extracted at epoch 0. These features 
are obtained from convolutions with weights are randomly 
initialized. 

4) CNN with Bagging Random Forest and SVM: 
Bagging is an ensemble method that has been used successfully 
with CNNs to achieve better accuracy ns. However, the per¬ 
formance gain is usually not dramatic. In addition, each CNN 
model is computationally expensive to train. Combining many 
CNN models takes non-trivial computational resources. In this 
section, we are interested to how the bagging performance 
compares to that of of SVM and Random Forest trained from 
features extracted by one model. 

To do this, we train 8 models with CNNl configurations. 
For each test sample, we obtain class probabilities from all the 
trained CNNs. Then, we average the predicted probabilities and 
pick the class with highest probability for final prediction. We 










obtain a performance boost from 0.5263 (one model accuracy) 
to 0.55526. However, this is instill inferior to accuracy of 
Random Forest or SVM (0.6 or above), as shown in Figure 

5) CNN with Maxout and Dropout: The reason for such 
large gap between CNN performance and S VM/Random Forest 
could be because of overfitting at the fully connected layer. In 
this experiment, we use Dropout, which is a training tech¬ 
nique equivalent to model averaging that improves prediction 
accuracy by controlling co-adaptation of weights ifT^ . Maxout 
is also an activation function that can be used with Dropout 
to further improves the accuracy lO. We train the CNN3 
model (Table that uses Maxout and Dropout. Similar to 
the previous section, we also extract the features as training 
input for Random Forest and SVM. 

TABLE IV. Accuracy of Random Forest and SVM on last layer 
CNN LEATURES 


Classifier 

Accuracy 

Baseline CNN 

0.64052 

Random Forest 

0.65526 

SVM One versus Rest 

0.65368 

SVM One versus One 

0.57184 


Table Hv] shows that the baseline CNN3 with Maxout and 
Dropout having better accuracy (0.64052) compared to the 
original models (0.5263 for CNNl and 0.5597 for CNN2). 
This is not surprising since Maxout and Dropout have been 
shown to improve accuracy results on many CNN models. 
However, we still obtain higher accuracy than the CNN 
baseline by using Random Forest and SVM on the last layer 
features of CNN3. 

We note that a model with dropout takes about ^ 3 times 
longer to train than the original model. This is also known 
in literature 0 Random Forest and SVM on CNNl features 
yield accuracy up to 0.6393 (see Figure which is a com¬ 
petitive result without using much additional computational 
resources. 

C. CNN Features for Clustering 

In this section, we demonstrate another use for CNN 
features for clustering and qualitatively explain why CNN 
features work well. This adds to the evidence of CNN as a 
good feature extractor. 

We consider the task of clustering 121 classes of plankton 
based on visual similarity. A naive approach for clustering is 
to find the centroids of each class in our original image space, 
and use a hierarchical clustering algorithm with the Euclidean 
distance as a distance metric. This performs poorly, as there 
are classes where samples look different from each other (e.g. 
under rotation - see Figure ). Thus the centroid of such 
classes would be a blob which gives little information about 
that class. 

We thus propose using the features extracted at a convo¬ 
lutional layer of a CNN for clustering purposes. We extract 
the features at the third layer of CNNl, calculate the centroid 
of each class in the feature space, and use a hierarchical 
agglomerative clustering algorithm with the Euclidean distance 
to cluster plankton, forming a dendrogram. 




Fig. 6. Part of dendrogram showing similar classes 


We look at a part of the dendrogram in Figure and 
give some reasoning as to why the 4 classes shown are 
clustered closely together. Figure [ 7 ] show class centroids (64 
dimensional vectors) which represent average feature scores 
for the respective classes. Based on this figure, the top 3 
common scores correspond to features 16, 24, and 40. 
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Class B: Copepod Calanoid Frilly Antennae 



Class C: Copepod Calanoid Flatheads 


Average Activation for Class 57 Layer2 



Class D: Copepod Calanoid 



Fig. 7. Class Samples and Feature Scores 

To show what feature 16, 24, and 40 represent, we use 
a visualization technique (DeConvnet) ca which show parts 
of images that most activate the corresponding features. The 
visualization in Figure shows that the feature 16 could 
represent “rounded blobs”, and feature 24, 40 could be “small 
tentacles”, which are common to these 4 classes. Since the 
CNN does clustering based on these features, this helps to 
explain why we get good clustering results. 

Figure is another part of the dendrogram, which shows 
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Class E: Tunicate Partial 



Class F: Hydromedusae Partial Dark 


Fig. 8. Visualization of Features 


another group of planktons. We can see based on Figure [ 7 ] and 


10 that the two groups which are far apart in the dendrogram 


have very different feature scores. However, within each group, 
the feature scores are strikingly similar. In Figure [T^ we can 
see that features 5 and 36 have high activations in these 3 
classes. Based on feature visualization, it looks like feature 5 
could be “membranes”, but feature 36 just “porous body”. 



Fig. 9. Part of dendrogram showing similar classes 
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Fig. 10. Classes Samples and Feature Vectors 


codes), using lower-level features can be more optimal, at least 
for classification with SVM and Random Forest. 


We believe that this can help in the development of 
phenetics, which is a method of classifying organisms from a 
species based on their visual similarity. Thus when biologists 
use a CNN for image classification, they can also extract the 
features and get such a dendrogram “for free”, which could 
be used as a dichotomous key. We also note that the features 
involving in the distance can also be picked according to 
feature visualization (DeConvNet). This would give a different 
metric which could potentially be useful for generating clusters 
based on different feature qualities. 

HI. Conclusion 

Our results show that Random Forest and SVM can be used 
with features from CNN to yield better a prediction accuracy 
compared to the original CNN. Even if the CNN is not optimal, 
e.g. not fully trained or overfits, it can still extract good 
features that give competitive prediction accuracy against more 
computationally expensive methods such as model averaging 
or CNN trained with Dropout. 

In addition, we found that in contrast to the practice of 
using the features from the last fully connected layer (CNN 


Our qualitative analysis also demonstrates why CNN fea¬ 
tures are useful in computer vision tasks. Instead of viewing the 
CNN as a black box, the visualization technique DeConvNet 
helps explains how similar images have similar CNN features. 

IV. Future Work 

We note that there are other CNN architectures that yield 
better classification accuracy for this dataset. Future work 
includes replicating these CNN architectures and use the 
CNN features with Random Forest or SVM. Our research 
can also be extended to study pre-trained CNNs such as 
AlexNet, GoogLeNet, etc. It would be ideal to experiment 
whether these architectures share the same trend we found 
that the CNN features can yield better classification accuracy 
compared to the original CNN. We would also like to see if 
the convolutional layers will yield better classification accuracy 
compared to the last fully connected layer which is used as 
traditional CNN features in literature. 

We can also experiment on other datasets such as CIFAR- 
10, CIFAR-100, MNIST, and ImageNet, etc. However, for 
large datasets such as ImageNet with over 1 million training 
samples. Random Forest might not be able to scale well. In 




































our experiment, training Random Forest takes a significant 
amount of memory with only 2bK training samples on 400 
trees. In addition, for dataset with large number of classes 
such as ImageNet (1000 classes), the one-vs-one SVM which 
trains classifiers for n classes might be too slow as well. 
However, SVM (one-vs-all) which trains n classifiers should 
be able to handle this. 
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